Replication archive for Jedwab, Remi and Adam Storeygard (2021) "The Average and Heterogeneous Effects of Transportation Investments: Evidence from sub-Saharan Africa 1960-2010", Journal of the European Economic Association.

Code and data in this archive can be used to replicate the results of Jedwab and Storeygard (2021). To simply replicate the tables and graphs in the paper, only Stata is needed (version 16.1 was used for some portions). Simply run 09createresults.do in Stata.

For the full replication from raw data, the following software is needed:
- Stata version 16.1 or higher
- ArcGIS Desktop 10.5.1 with Advanced (ArcInfo) license and the Spatial Analyst Extension (two manual steps noted in 08createmaindata.do were performed in Mapinfo 7.8, but ArcGIS should work as well)
- Python 2.7.13 (We ran them in IDLE; other methods may require editing)
- Matlab R2019b, running on a Linux cluster
- Slurm, running on a Linux cluster (the Matlab routines called by 05slurm.sh could alternatively be run on a desktop with enough memory, or on linux without Slurm)
Versions shown are what we used. It is possible that other versions will work but we cannot be sure.

The archive is organized as follows:
- files.xlsx lists and describes all required raw input files
- The code folder contains all code needed to replicate results, run in order from 01 to 09.
- data/inputs contains all raw input data except restricted DHS data (see below). All such files are listed in files.xlsx.
- data/restrictedinputs is where restricted DHS data should be placed, once accessed.
- Intermediate data produced by the code, including the stata files on which the regression code can be run, are in the data folder.
- Resulting tables and figures are in the output folder.

Restricted DHS data: Demographic and Health Survey (DHS) data cannot be distributed with this replication archive due to confidentiality restrictions. Users interested in replicating that portion of the analysis must request survey and GPS data for the following 16 countries at measuredhs.com: Benin, Burkina Faso, Cameroon, Cote d'Ivoire, Ethiopia, Ghana, Guinea, Kenya, Malawi, Mali, Namibia, Nigeria, Senegal, Tanzania, Uganda, Zimbabwe. Such a request requires briefly describing the intended use of the data, and are usually approved in considerably less than a week. Once approved for both survey and GPS data, the files listed for the "Standard DHS" surveys in the years listed in files.xlsx (tab: "restricted_DHS_data") must then be downloaded and placed into data\restrictedinputs\geo and data\restrictedinputs\women.
The DHS file BJGE33FL.dbf must then be saved as a comma-separated values (CSV) file BJGE33FL.csv before proceeding further. 

Below is a detailed description of the inputs and outputs of each code file, including the proximate source of each input file:
01roadload.py
02stataload1.do
03othergeoload.py
04stataprepformatlab.do
05slurm.sh
06dhsfertmort.do
07statacombineall.do
08createmaindata.do
09createresults.do

01roadload.py
Purpose: load roads, railroads and borders for all countries and years
Inputs
	roads_<X>_polyline.shp, <x>={CS,NE,NW} regional shapefiles of roads with surface information for all roadyears  (3 raw)
	Railroads_polyline.shp - railroads (raw)
	pol_africa_region42.shp - boundaries (raw)
	roads_polyline.shp - roads from Deichmann and Nelson (raw)
 	prioritytab.csv - a conversion table from road classes to costs of traversing a square (raw)
	countrycodes.csv - list of countries with various codes and names (raw)
	regextents.csv - list of region extents (raw)
 	provinces_1960_final_region.shp
	provinces_2010_final_region.shp
Outputs (262)
 	road surface grids by region-year (64)
	road length grids by region-surface-year (192)
 	railyearp.csv - grid points with year rail built
	isonv10.tsv - country identifier grid
	province1960.tsv - 1960 province identifier grid
	province2010.tsv - 2010 province identifier grid
	roadgrid2.gdb containing two small tables (container for later work)
	uweroadlengthtab.csv
	raillengthtab.csv
	
02stataload1.do
Purpose: load cities data and convert to grid
inputs:
	countrycodes.csv (raw)
	cities_09182015.xls (raw)
	ged191.zip (raw)
	isonv10.tsv (01roadload.py)
output:
	citygrid.dta
	citiesclean.dta
	countrycodes.dta
	samplepoints199814.csv
	ucdpged191_2000.csv
	ucdpged191_2010.csv

03othergeoload.py
inputs: 
	afpop<x>.tif, x={60,70,80,90,00} (5 raw)
	gpw-v4-population-count_<x>.tif, x={2000,2010} (2 raw)
	GREG.shp (raw)
	Murdock_EA_2011_vkZ.shp (raw)
	res03_crav6190l_sxlr_{0}.tif (raw)
	pol_africa_region42.shp (raw)
	countrycodes.csv (raw)
	F<x>.v4b_web.stable_lights.avg_vis.<y> for <x>={101992,142000,152000,182010} X <y>={tfw,tif.gz} (8 raw)
	samplepoints199814.csv (02stataload1.do)
	ucdpged191_2000.csv (from 02stataload1.do)
	ucdpged191_2010.csv (from 02stataload1.do)
output:
	poptab.csv
	cfldumtab2000.csv
	cfldumtab2010.csv
	nbtab.csv	
	F<x>.mean.tsv for <x>={101992,142000,152000,182010}

04stataprepformatlab.do
inputs:
	ports_remi_v6.xls (raw)
	regextents.csv (raw)
	mines.dta (raw)
	countrycodesworld.dta (raw)
	WDIandGDF_csv.zip (raw)
	gdp_data_desc_paper.dta (raw)
	isonv10.tsv (01roadload.py)
	province1960.tsv (01roadload.py)
	province2010.tsv (01roadload.py)
	railyearp.csv (01roadload.py)
 	road surface grids by region-year (64; 01roadload.py)
	road length grids by region-surface-year (192; 01roadload.py)
	citygrid.dta (02stataload1.do)
	countrycodes.dta (02stataload1.do)
	poptab.csv (03othergeoload.py)
	cfldumtab2000.csv (03othergeoload.py)
	cfldumtab2010.csv (03othergeoload.py)
	F<x>.mean.tsv for <x>={101992,142000,152000,182010} (03othergeoload.py)
	nbtab.csv (03othergeoload.py)
	ports_1960.xls (raw)
	ports_2005.xls (raw)
	conflict.xls (raw)
	FDP2008.xls (raw)
	main_sample.dta (raw)
	popsources.xls (raw)
	mines_production.xls (raw)
	mines_years_v2 (raw)
	mines_still_missing_v1_VJ_RJ (raw)
	list_countries.xlsx (raw)
	Data_Extract_From_WDI_Database_Archives_(beta).xlsx (raw)
	WEO_SSA.xlsx (raw)
	WEO_MENA.xlsx (raw)
	maddison_data.dta (raw)
outputs:
	contgrid.csv: road & pop info for all years, & neighbor ids, 
	contgrid.dta
	conflictdecades.dta
	cityneighborcountries.dta
	cxcylights.dta
	nbtabcosts.csv
	natgdptab.csv
	landsuit.dta
	geocontrols.dta
	greggrown.dta
	regcapgrid.dta
	distmatches.dta
	ports_1960.dta
	ports_2005.dta
	indep_war_new.dta
	refugees.dta
	mines.dta
	censuscountryperiod.dta
	badcountryperiod.dta
	gr_neighborcountries.dta
	
05slurm.sh
Purpose: calculates many variants of market access, and lengths of roads by ring-octant. Note that this script, all subscripts and all inputs must be placed in the same unix directory where 05slurm.sh is located, and then the following two commands run from that directory:
chmod 755 05slurm.sh
./05slurm.sh
inputs:
	timecost.csv (raw)
	contgrid.csv (03othergeoload.py)
	nbtabcosts.csv (04stataprepformatlab.do)
	natgdptab.csv (04stataprepformatlab.do)
subscripts:
	cellcostcalc.m
	citycostcalc.m
	citycostcalctm30.m
	wrap202004.m
	wrapcell202004.m
	wraptm30.m
	ringoctantavgcostlengths.m
	repnan.m
outputs:
	market access csv files (1028)
	ring-octant length csv files (30)

06dhsfertmort.do
Purpose: calculates city-level estimates of the rate of natural increase using DHS data.
inputs:
	WDI_csv20200603.zip - World Development Indicators (raw)
	<CC>IR<NN>FL.DTA, where CC is a country code and NN is a version code - DHS cluster locations (53 restricted raw)
	<CC>GE<NN>FL.dbf, where CC is a country code and NN is a version code - DHS cluster locations (53 restricted raw)
	BJGE33FL.csv - see above (raw)
	contgrid.dta (04stataprepformatlab.do)
outputs:
	imrcfr_cdrcbrni.dta (restricted)

07statacombineall.do
Purpose: Combines all matlab output into three stata files
inputs:
	market access csv files (1028; 05slurm.sh; these must be moved/copied from the linux cluster to the data/matlabout folder)
	ring-octant length csv files (30; 05slurm.sh; these must be moved/copied from the linux cluster to the data/matlabout folder)
outputs:
	citymp.dta
	cellmp.dta
	ringoctantlengthssize1new.dta

08createmaindata.do
Purpose: combine files into regression files
inputs:
	imrcfr_cdrcbrni.dta (restricted; 06dhsfertmort.do)
	citymp.dta (07statacombineall.do)
	cellmp.dta (07statacombineall.do)
	ringoctantlengthssize1new.dta (07statacombineall.do)
	contgrid.dta (04stataprepformatlab.do)
	conflictdecades.dta (04stataprepformatlab.do)
	cityneighborcountries.dta (04stataprepformatlab.do)
	cxcylights.dta (04stataprepformatlab.do)
	landsuit.dta (04stataprepformatlab.do)
	geocontrols.dta (04stataprepformatlab.do)
	greggrown.dta (04stataprepformatlab.do)
	regcapgrid.dta (04stataprepformatlab.do)
	distmatches.dta (04stataprepformatlab.do)
	ports_1960.dta (04stataprepformatlab.do)
	ports_2005.dta (04stataprepformatlab.do)
	indep_war_new.dta (04stataprepformatlab.do)
	refugees.dta (04stataprepformatlab.do)
	mines.dta (04stataprepformatlab.do)
	censuscountryperiod.dta (04stataprepformatlab.do)
	badcountryperiod.dta (04stataprepformatlab.do)
	gr_neighborcountries.dta (04stataprepformatlab.do)
	city_mcid_provinces19602010.csv
	drought_st.csv (raw)
	ranks_1960_1970.xlsx (raw)
	coord_megacells_v2.csv (raw)
	ethnic_politics_data.xls (raw)
	colonynew.xlsx (raw)
	natpark.csv (raw)
	border_crossings_NW_2014_final.csv (raw)
	airports.csv (raw)
	regional_capitals_v10.xls (raw)
	top_5_cities_1960.xlsx (raw)
outputs
	citypanel.dta
	panel_everpop_final_megacell<x> for <x>={3,5,7,9} (4)
	panel_everpop_final_megacell<x>_notopreg60top5 for <x>={3,5,7,9} (4)

09createresults.do
Purpose: Produces all regression tables and graphs
inputs:
	citypanel.dta (08createmaindata.do)
	panel_everpop_final_megacell<x>.dta, x={3,5,7,9} (4; 08createmaindata.do)
	panel_everpop_final_megacell<x>_notopreg60top5.dta, x={3,5,7,9} (4; 08createmaindata.do)
outputs:
	37 tables or table parts
	4 graph figures (A.2, A.4, A.7, A.8)
